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ABSTRACT 

This article discusses various methods of representing and manipulating arbitrary coverage information in two dimensions, with a 
focus on space- and time-efficiency when processing such coverages, storing them on disk, and transmitting them between computers. 
While these considerations were originally motivated by the specific tasks of representing sky coverage and cross-matching catalogues 
of astronomical surveys, they can be profitably applied in many other situations as well. 

Key words, methods: numerical 


1. Introduction 


This paper focuses on finding useful data representations for de¬ 
scribing a subset of pixels on a 2D data grid (for an example see 
Fig. 0- The boundary of this subset is sharp, which means that 
pixels can take one of exactly two values: 0 (outside the set) or 
1 (inside). The shape formed by the pixels can be arbitrary (it 
may, e.g., be disconnected and/or contain holes), but when judg¬ 
ing the merits and drawbacks of the individual representations, 
it will be assumed that in typical scenarios, the shapes will not 
be pathological (e.g. they will not consist of clouds of isolated 
pixels or have fractally convoluted boundaries). 


The notion of “usefulness” of course requires a usage sce¬ 
nario. For this work, the concrete motivation was to improve 
support for astronomical databases with typical tasks, such as 
catalogue cross-matching (finding the overlap between two, po¬ 
tentially very complicated, shapes on the celestial sphere) and 
cone searches (i.e. obtaining all database objects lying within a 
certain shape on the sphere). It was inspired by the IVOA stan¬ 
dard for multi-order coverage objects (|Boch et al.|2014[ Fernique 


et al.|2015l l and tries to evaluate potential shortcomings, as well 
as to suggest improvements. 


The choice of describing shapes in an approximate way as 
a set of pixels on a grid rather than by analytical geometri¬ 
cal expressions was driven by efficiency concerns. Performing 
database queries using complicated analytical shapes is a diffi¬ 
cult and time-consuming task, and trading a small, tunable num¬ 
ber of false positives in the query result for much better run time 
is beneficial in many real-world situations. 


The remainder of this paper is structured as follows. Sec¬ 
tion]^ gives an in-depth description of the problem we set out 
to solve. The subsequent sections discuss advantages and draw¬ 
backs of various approaches to the individual sub-tasks and jus¬ 
tify our decision for a particular subset of them. Section [^eval¬ 
uates different ways to enumerate pixels on the grid. Section 
[^ introduces ways to incorporate resolution information into 
pixel numbers. Section [^discusses structure layouts for storage 



Fig. 1. 2D shape and its approximate discrete representation as a set of 
pixels. 


in main memory, and Section [^ introduces ones for long-term 
storage and transmission. A rough outline for an algorithm that 
generates approximate shape representations from analytical de¬ 
scriptions is given in Section [^ Section lists popular tesse- 
lations of the sphere and discusses whether our scheme can be 
applied to them, as well as explaining our choice of HEAFPix 
for implementation and testing, which in turn is the subject of 
Sectionj^ Finally, we present conclusions in Section [TO] 


2. Detailed problem description 

The task sketched above appears fairly straightforward at first 
glance, but it is certainly worthwhile to quickly present the many 
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different aspects that need to be considered when attempting a 
solution. 

2.1. Requirements 

The introduction already hinted at two desirable features of the 
final data structure: 

1. Boolean operations (merging, intersection, complement, 
tests for overlapping and containment) on shapes should be 
fast. 

2. The choice of pixel numbering should provide good locality; 
that is, typical shapes should be described by relatively few 
ranges of consecutive pixel numbers. This property speeds 
up database queries related to the shape. 

An additional obvious design goal is that the data structure 
should be reasonably compact. To be more precise: 

3. The memory requirement of the data structure should grow, 
with increasing grid resolution, at a lower rate than the num¬ 
ber of pixels covered by the shape. A realistic goal seems to 
be a growth rate proportional to the number of pixels crossed 
by the shape’s boundary. 

For “simple” shapes like circles and squares, this implies a struc¬ 
ture size proportional to the square root of the number of pixels 
in the shape. At the other extremes (highly convoluted or frag¬ 
mented shapes), the growth rate will be higher, and it approaches 
that of the total number of pixels again, but in such a scenario, 
there is little hope for optimisation either. 

In many situations the desired shape will be given by analyti¬ 
cal expressions; for example, in the case of a simple cone search, 
this would be a circle. In astronomy, more complicated analyt¬ 
ical shapes are often expressed using the STC grammar ( [Rots 
|2007| l or the MANGLE tools ( [Swanson et al.|[25o8| l. Typically 
the boundary of such shapes does not coincide with the pixel 
boundaries of the chosen grid, which leads to unavoidable inac¬ 
curacy of the discretised pixel-based representation, to a degree 
that depends on the grid’s chosen resolution. 

Since the derivation of a discretised representation from an 
analytic shape tends to be an expensive operation, and since it is 
quite common that the discretised representation is subsequently 
needed at several different resolutions, we require further that 

4. The data structure must allow quick resolution changes (by 
factors of 2" in both directions). 

For instance, going from a representation of a shape on a 64x64 
grid to one on a 32x32 grid at half resolution and vice versa 
must be a very efficient operation. The direction towards higher 
resolutions can obviously be performed without any loss of in¬ 
formation, while the other direction leads to a coarsening of the 
discretised shape and additionally raises the question of how to 
treat partially covered pixels. We require that coarsening opera¬ 
tions must allow the user to specify whether such pixels should 
be kept as part of the coarser shape or discarded. 

2.2. Grid properties 

So far, only a single 2D grid patch with arbitrary dimensions in 
both spatial directions has been considered. For practical pur¬ 
poses, this paper will focus on a slightly modified scenario. On 
the one hand, we consider not only one patch, but instead a set 
of No patches (or base pixels) of the same size, on which the 
shape representation can reside. On the other, we constrain all 
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Fig. 2. Subdivision of the sphere’s surface into quadrangular patches by 
HEALPix (left) and QuadCube (right). 


patches to the same dimensions of 2" pixels in each spatial direc¬ 
tion, where o (for “order”) is the single parameter determining 
the overall grid resolution. This constraint of patch resolutions 
to powers of 2 harmonises well with requirement 4 and has been 
adopted in practice by many pixelisation schemes. 

As a practical example, these modified grid properties allow 
a very intuitive description of the HEALPix grid ( jGorski et al.| 
2005] l with its No = 12 base pixels and its resolution param¬ 
eter Aside, which is equivalent to 2°, or QuadCube ( [White & 
Stemwedel [1992 i, with No - 6 base pixels and resolution de¬ 
scribed by the quantity p o + l (see Fig.[^. Constraining grid 
resolutions to powers of 2 may seem more draconian than it re¬ 
ally is; if necessary, grids with other dimensions can be emulated 
by extending their dimensions to the next applicable power of 2 
and by considering all pixels in the surplus area as unset. 


2.3. Accuracy of the shape representation 

As already mentioned above, discretising an analytical shape on 
a grid with finite resolution will (in most cases) introduce inaccu¬ 
racies. Using a simple illustration like Fig.[^ it is straightforward 
to derive a rough estimate for the fractional error of the discrete 
representation: 


Qxcess ^boundary , 

oc —^- d, 


A si 


shape 


1shape 


piX’ 


( 1 ) 


where Aexcess is the excess area of the discrete representation, 
Ashape and Lboundary are the original shape’s area and boundary 
length, and dp^ is the linear dimension of a grid pixel. Of course 
this estimate is only applicable on “well-behaved” grids where 
pixels have almost equal sizes and are not very elongated; oth¬ 
erwise, the linear pixel dimension would not be a well-defined 
quantity. 

The estimate illustrates that the fractional error scales lin¬ 
early with the grid resolution; thus, for a specific desired er¬ 
ror tolerance, it is possible to choose an appropriate pixel size 
whenever computing the approximate discrete representation of 
an analytical shape. Since dp^ cannot become larger than the lin¬ 
ear dimension of a base pixel, the relation above naturally holds 
only on smaller scales. Given the typically small numbers of base 
pixels in spherical pixelisation schemes - 12 for HEALPix, 6 for 
QuadCube, 8 for HTM ([Szalay et al.[2007l ), 12 for basic IGLOO 
( [Crittenden & Turok|1998[ l - this is not a real problem in most 
cases. However, for pixelisation schemes with very many base 
pixels (e.g. some of the more involved IGLOO tilings that can 
have Aq > 10 000), the coarsest possible discretisation of a shape 
may be finer (and therefore more memory-consuming) than re¬ 
quired by the application. 
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Fig. 3. Traditional linear numbering scheme. 


3. Evaluation of pixel numbering schemes 

Starting from the goals established in Section]^ the first task is to 
assign indices to the individual pixels in the 2D grid, preferably 
in a way that aids compact representation of shapes. Assuming 
No base pixels and a resolution order o, there are ripix.o := No2^ 
pixels in total on the grid, and giving them numbers from the 
interval |^0;Aio2^°|^ seems a reasonable choice. Analogously, it 
is most likely a good idea to group pixels belonging to the 
same patch together, i.e. associating indices |^ 0 ; 2^°^ with the first 
patch, indices |^2^°; 2 x 2^°|^ with the second patch, etc. Obvi¬ 
ously this choice of pixel numbers only identifies a pixel within 
the grid at a specified order; for an approach to encoding both 
order and location in the grid in a single number, see Sect. 

3.1. Linear ordering 

Several options are available, however, for the indexing scheme 
for the pixels within a patch. A very well-known and intuitive 
choice would be a linear ordering that counts first all pixels in 
the first row of the patch, then those in the second row, etc. 
(see Fig. [^. While very easy to implement, this arrangement is 
sub-optimal for the purpose of compact shape representation: the 
traversal of a full spatial dimension for every row implies bad lo¬ 
cality properties (conflicting with requirement 2). Another, less 
severe drawback is the nontriviality of resolution change oper¬ 
ations (requirement 4): while not exactly daunting, grid refine¬ 
ment and coarsening are more complicated than in other, more 
promising numbering schemes. 

3.2. Hierarchicai schemes 

Given the above-mentioned shortcomings of the most straight¬ 
forward approach to pixel numbering, it seems worthwhile to 
specifically investigate schemes that have some sort of resolu¬ 
tion hierarchy built in. One family of such schemes can be con¬ 
structed using the simple recursive rule that a pixel with the num¬ 
ber p at resolution order o must coincide with the four pixels 
[4p; 4{p + 1)[ at order o + \. 

For every numbering scheme that follows this rule, require¬ 
ment 4 is trivially fulfilled: p’s parent pixel at order 02 < o has 
the number and at higher orders 03 > o, it covers 

the range of pixels [p x (p H- 1) x All necessary 

conversions between these numbers can be performed using ex¬ 
tremely quick bit-shift operations. 

While reducing the possible numbering choices, the above 
constraint does not specify a unique ordering in itself: there is 
still the choice of how exactly the available indices are assigned 
to the four sub-pixels when going from order o to o H- 1. Since 
this choice can be made differently for each single pixel and at 
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Fig. 4. Three levels of refinement for the Z curve (left) and Peano- 
Hilbert curve (right). 


each order, the number of theoretically available numberings is 
huge, but fortunately, only a handful of these have attractive ge¬ 
ometrical and algorithmic properties. 

Figure]^ shows two well-studied variants known as the Mor¬ 
ton (or Z) and Peano-Hilbert curves. For both subdivision strate¬ 
gies, the recursive nature and self-similarity of the pixel order¬ 
ing is clearly visible. The figure also demonstrates intuitively 
that the Peano-Hilbert curve tends to have better locality than 
the Z curve, because the geometrical distance between Peano- 
Hilbert pixels n and n -H 1 is always the shortest possible one (i.e. 
the linear dimension of a single pixel), whereas the Z curve ex¬ 
hibits fairly large spatial jumps between pixels of neighbouring 
indices. 

Unfortunately, the improved locality properties of Peano- 
Hilbert ordering come at a price: converting the two-dimensional 
location {xi, y,) of a pixel in the grid at order n to its correspond¬ 
ing Peano-Hilbert index is an iterative process with n individual, 
nontrivial steps, and the same holds for the inverse operation 
( |Lam & Shapiro|fl994[ ). The same tasks for the Z curve, how¬ 
ever, can be achieved by comparatively simple bit manipulations 
and, on modern CPUs, even by using specialised machine in¬ 
structions Q 

Whether efficient index calculations should be preferred over 
locality (hence compactness of representation) or vice versa de- 

' PDEP and PEXT, see e.g. http://en.wikipedia.org/wiki/ 
Bit_Manipulation_Instruction_Sets Unfortunately, compiler 
support is not yet wide-spread. 


Article number, page 3 of|^ 

























































































































































A&A proofs: manuscript no. paper 


pends on the task at hand. It should be noted, however, that, 
among all algorithms discussed in this paper, only the initial 
generation of a pixelised shape representation requires any con¬ 
version of pixel indices. All other operations (such as resolution 
changes. Boolean operations on one or multiple shapes, or con¬ 
version to long-term storage format) are completely oblivious 
to the the chosen numbering scheme and will only benefit from 
good locality, but not suffer from slow index computations. We 
expect that once created, a given shape will usually be processed 
in queries many times, so that any slowness during creation will 
be more than amortised by the speedups of a compact represen¬ 
tation; therefore, Peano-Hilbert should be the preferred ordering. 

Anotheytossible choice for hierarchical ordering, based on 
Gray code^ should be noted for completeness. Its locality lies 


between those of Z and Peano-Hilbert curves (see, e.g.. Moon 


|et al.|200T| for benchmarks), and its index computations are about 
as complex as those of Peano-Hilbert. Consequently we will not 
investigate this option further. 


orders, this drawback can be overcome. If the 0„ are constrained 
to be powers of 2, determining the order of a given unique pixel 
number i becomes equivalent to computing [log 2 i\, which can be 
done in constant time by specific machine instructions on most 
current CPUs. The general formula for the smallest possible O,, 
that is a power of 2 for a given No reads as 

On = 2^" X [2ri°g2(4A'o/3)l _ (3) 

As a practical example, for HEALPix with No - 12, one ob¬ 
tains On - 2^"^^, which means that the 12 pixels at order 0 have 
the unique indices 4 to 15, those at order 1 have 16 to 63, etc. In¬ 
terestingly, the number of bits required to represent a HEALPix 
pixel at any order does not increase when switching to unique 
indices. 

It is also worth mentioning that in the above example and, 
generally, in all cases where On is an integer multiple of 2^", the 
convenient hierarchy property described in Section [3^ holds for 
unique pixel numbers as well. 


4. Unique multi-order indices 

Sectionj^discussed various choices for numbering pixels within 
the grid at a given resolution order. The resulting numbers do not 
carry any information about the order of the grid they refer to, 
so this quantity must be known from other sources. In some cir¬ 
cumstances it is convenient to encode this resolution information 
together with the pixel index in a single number, and this section 
discusses economical ways to achieve this goal. 

The simplest approach from an implementation standpoint 
would be to reserve a few bits of the pixel number for directly 
storing the order. This allows conversions from (order -H pixel 
index) to unique pixel number and vice versa in constant time, 
using only simple bit shift and bit masking operations. It has, 
however, the drawback that the number of bits required for stor¬ 
ing the order is [log 2 Omaxl, which is quite wasteful compared to 
alternative approaches. Potentially worse, it requires the agree¬ 
ment on a common Omax over all involved applications. 

Alternatively, it is possible to start numbering the pixel in¬ 
dices for order n from an offset On instead of 0, with the no¬ 
overlap constraint that On+\ is greater than the highest pixel num¬ 
ber at order n: 


On+l ^ On "t" n. 


piX,/7 • 


( 2 ) 


Pixel numbers can be minimised by choosing Oo 0 and 
the equality On+i := G„H-«pix,n- Using this dense packing, and be¬ 
cause in two dimensions npix,«/npix,«-i = 4, the maximum unique 
pixel index at order n is never larger than 4npix2,/3. This implies 
that the necessary bit length for the unique pixel index is at most 
one bit higher than for the standard pixel index at any resolution, 
which compares favourably to the approach described above. As 
an aside, this growth of at most one bit is also true for ID and all 
higher-dimensional scenarios. 

While conversion from (order, pixel number) pairs to unique 
indices is still a 0(1) operation in this scheme, the inverse opera¬ 
tion becomes more expensive in the general case, because deter¬ 
mining the order corresponding to a given unique pixel index can 
only be done via interval bisection in (9(log2 Omax) steps, which 
is undesirable. 

By giving up the dense packing and allowing for some un¬ 
used pixel indices between the blocks for the various resolution 


^ A method of binary representation of integer numbers with (proba¬ 
bly accidental) hierarchical properties; see https://en.wikipedia. 


org/?title=Gray_code and|Moon et al. j2001 
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5. In-memory representation 

In this section we discuss alternative data structures describing 
a shape on a chosen 2D grid, assuming one of the hierarchical 
pixel ordering schemes presented in Section [T2] The goal is to 
determine the structure most suitable for Boolean operations and 
resolution changes. Another, more specialised, format for stor¬ 
age and transmission is presented in Section]^ 

The naive choice of simply storing a list of all pixels at a 
given resolution that lie within the shape can be discarded im¬ 
mediately as impractical. The number of entries in such a list 
would always scale with 0(2^°) in direct contradiction to re¬ 
quirement 3 and would very quickly become unmanageable in 
typical scenarios. As an example, a shape covering ten percent 
of the sphere and stored at a resolution of 1 arcsec would contain 
roughly 5 ■ 10'® pixels, which does not fit into most computers’ 
main memory. 

5.1. Multi-order list 

One way to avoid the prohibitive size of a full pixel list at the 
highest resolution was suggested in the MOC standard document 
( |Boch et'ar]|2014[ ), and it exploits the hierarchical property of 
the pixel numbering scheme; instead of having a single list of 
pixels at the highest resolution Omax, a sorted list is kept for every 
resolution order 0 < o < Omax- Starting from o = 0, all pixels 
at this resolution that are completely covered by the shape are 
entered into the list for this order. Then the procedure is repeated 
for the remainder of the shape at the next higher resolution, until 
the desired resolution is reached. 

This procedure yields a very compact representation of the 
shape (called multi-order list, or MOL, below), which certainly 
fulfils requirement 3. Also, this structure facilitates resolution 
changes. Resolution upgrade is an empty operation, and coarsen¬ 
ing simply consists of removing the pixel lists above the desired 
resolution, if partially covered pixels are ignored; if they should 
be kept, the pixel lists have to be updated recursively from higher 
to lower resolutions. 

Unfortunately, multi-order lists are not well suited to 
Boolean operations. To our knowledge, no algorithm exists to 
compute, for example, the union of two shapes in (9(ni -i-n 2 ) time, 
where and n 2 are the respective total lengths of both shapes’ 
multi-order lists. Thus, the MOL representation does not satisfy 
requirement 1. 
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It should be noted that, by construction, the total size of a 
multi-order list for a given shape does not depend on the particu¬ 
lar choice of hierarchical numbering scheme; in other words, the 
MOL representation does not benefit from the improved locality 
of the Peano-Hilbert curve compared to that of the Z curve. 

5.2. Range set 

Since storing lists of pixels at various resolutions complicates 
Boolean operations, a preferable method is to simply store a 
sorted collection of ranges of pixels (characterised by their be¬ 
ginning and end) at the highest resolution. We refer to this struc¬ 
ture as a range set (RS). 

There are different ways to represent a range of numbers: 
One could use the hrst number plus the range’s length or the first 
and last numbers of the range, for instance. Mostly for reasons 
of symmetry, we are adopting a representation that is commonly 
used by programmers by storing the first number of the range 
and the first number after the range. While this choice may seem 
unintuitive, it is technically preferable in many small ways over 
the other representations M 

Even though formally residing on a high-resolution grid 
alone, the range set representation still benefits substantially 
from the hierarchical numbering scheme: each pixel at resolu¬ 
tion order o that is completely filled is represented by a single 
interval of pixel numbers at higher resolutions 02 > o, by con¬ 
struction. This allows a worst-case size estimate of the RS repre¬ 
sentation compared to MOL: assuming that no pixel ranges can 
be merged, every pixel in the MOL becomes an isolated pixel 
range in the corresponding RS. Since a range is characterised by 
two numbers instead of one, the RS representation is, at worst, 
twice as large as the MOL. 

There are also situations where the RS is smaller than the 
MOL, however: assuming a shape that covers the entire grid 
except for a single pixel at order «, the MOL will consist of 
Nq - I + hn pixels, whereas the RS will contain at most two 
ranges. Generally, the more regular a shape (i.e. the lower its 
ratio of boundary length to surface), the smaller its RS represen¬ 
tation will be. 

In practice both representations tend to have a similar size 
for compact shapes, with the RS typically being slightly smaller 
than the MOL. For very convoluted or fragmented shapes, the 
MOL representation has the advantage, and the size ratio ap¬ 
proaches the limit of 2:1 in favour of the MOL. See Sect. |9.3| for 
real-world examples. 

In contrast to multi-order lists, RSs do benefit from better 
locality properties of the underlying pixel numbering scheme, 
which means that a RS representation of a shape on a Peano- 
Hilbert-indexed grid has, on average, fewer and longer ranges 
than on a Z-curve-indexed grid. This reduction in the number of 
ranges is more pronounced for “regular” shapes and vanishes in 
the limit of complete fragmentation (see the third row of Fig. 
for real-world examples). When the pixel numbering is used to 
index entities in a database, the improved compactness of the 
Peano-Hilbert representation improves query times. 

Range sets behave very similarly to multi-order lists for reso¬ 
lution changes. Resolution increases are again empty operations, 
and decreases - whether inclusive or exclusive - require simple 


^ As an example, it gives symmetry to “on” and “off” ranges, making 
inversion operations trivial to implement. Also, this approach of de¬ 
scribing a range set leads to a strictly increasing sequence of numbers, 
which is convenient for compression (see Sect.|^. 


adjustments of range borders and potentially merging of ranges, 
which scales linearly with the number of ranges in the set. 

The substantial advantage of RS over the MOL representa¬ 
tion lies in the simplicity and efficiency of all Boolean oper¬ 
ations. This includes union and intersection of two shapes, as 
well as finding the complement of a shape and testing whether a 
shape contains, overlaps with, or is equal to another. These op¬ 
erations can be carried out in at most 0{ni + 112 ) steps, where 
rii and n 2 are the range counts in both involved shapes. For 
the subset and overlapping tests, alternative algorithms with 
the complexity (9(nminlogWmax) (with Mmin min(ni,n 2 ) and 
«max •= rnax(ni,n 2 )) may be used, which can be advantageous 
if both shapes have very different range counts. In most situa¬ 
tions, this advantage should more than balance out the drawback 
of the potentially larger RS size compared to MOL. 

In fact, the most efficient way to perform Boolean operations 
on MOL that the authors have discovered so far is to convert the 
input MOL to RSs, perform the desired operation, and convert 
the result back to MOL. (This is also the approach recommended 
by the MOC standard.) The conversions between both represen¬ 
tations have a slightly higher complexity than the Boolean op¬ 
eration itself, since they involve sorting and merging of sorted 
pixel lists at all involved resolution levels. 


6. Long-term storage and transmission format 

Section presented a data structure suitable for fast data pro¬ 
cessing. In this section the focus is shifted towards finding a 
representation of the 2D shape that does not directly support 
Boolean operations, but is even more compact, while still being 
efficiently convertible to the RS or MOL representation. This 
data structure could be used for space-efficient storage on disk 
or for quick transmission over a network. In other words, we can 
relax our computation-related requirements 1 and 4 and concen¬ 
trate mainly on size-related requirement 3 in this section. 

In this context it is very helpful to observe that both the range 
set representation and the multi-order list representation (assum¬ 
ing unique pixel indices, see Sect. of a shape take the form of 
strictly monotonous sequences of non-negative integers. In addi¬ 
tion, these integers tend to build - at least for non-pathological 
shapes - relatively compact clusters within the possible range of 
values. 

As such, these sequences are perfectly suited for binary in- 


terpolative coding (Moffat & Stuiver 


20001 , a compression al¬ 


gorithm typically used for lookup tables in search engines. This 
method provides a fairly simple and quick means to convert be¬ 
tween the sequence and a highly compressed bit stream contain¬ 
ing equivalent information. The algorithm has the complexity 
0{n) in both directions for a sequence of n elements and can 
be implemented and customised for the task at hand using no 
more than a few hundred lines of code. Compression factors typ¬ 
ically range from 2 to 10, and the time required for conversion 
is roughly comparable to the conversion time between MOL and 
RS representations. For concrete performance measurements on 
a large set of test data see Sect.j^ 


7. Generation from analytical shapes 

Creating sets of pixels that approximately represent an ideal geo¬ 
metric shape can be achieved in many different ways. We brieffy 
present an approach that poses minimal requirements on both 
the description of the shape and the grid geometry, has low com¬ 
plexity, and is conceptually easy to implement. 
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In addition to the algorithms discussed so far, only one fur¬ 
ther ingredient is required: a function which, given a specific 
pixel and order, returns whether 

- the pixel lies completely within the shape; 

- the pixel lies completely outside the shape; 

- the pixel centre lies inside the shape, but it cannot be decided 
whether the pixel lies on the boundary or is completely in¬ 
side; or 

- the pixel centre lies outside the shape, but it cannot be de¬ 
cided whether the pixel lies on the boundary or is completely 
outside. 

To obtain the desired set of pixels, the following algorithm is 
executed for all Nq base pixels: 

- test the current pixel with the function given above; 

- if it is completely inside the shape, append it to the result 
pixel list; 

- (if it is completely outside, do nothing); 

- otherwise: 

- if the order o of the current pixel is smaller than the max¬ 
imum order Omax for this query, call the algorithm recur¬ 
sively for its four sub-pixels. 

- if o = Oinax and the centre is inside the shape, append the 
pixel to the result pixel list. 

- if o = Omax and the centre is outside the shape, append the 
pixel to the result pixel list only if the query is inclusive. 

Here, an inclusive query means that all pixels potentially over¬ 
lapping with the shape are included in the result, in contrast to 
standard queries, which simply return all pixels whose centres lie 
inside the shape. While standard queries give exact results, the 
results of inclusive queries may contain a small number of false 
positives, which are pixels that do not actually overlap with the 
shape. Given the simple inside/outside criterion described above, 
this is unavoidable. 

This algorithm has the advantage of checking high- 
resolution pixels only near the shape’s boundary. Far away 
from the boundary (whether inside or outside), the status of the 
checked pixels can already be determined at low resolutions, so 
that the recursion terminates early. For reasonably compact and 
non-convoluted shapes, this leads to a complexity proportional 
to the length of the resulting RS, which is very welcome. 

The above algorithm has been used in almost exactly this 
form at the core of the query_disc and query_polygon rou¬ 
tines of the HEALPix C-i-H- library for several years and has 
proven very robust. 


8. Applicability to popular pixelisation schemes 

The techniques developed above can, in principle, be applied to 
any hierarchical grid. The benefits, however, depend on the in¬ 
dividual grid’s properties. In the following, we present some of 
the most popular spherical pixelisations and quickly discuss the 
feasibility and possible limitations of applying our techniques to 
them. 

ECP: This long-known grid with pixel centres equidistant in lat¬ 
itude and longitude can be designed to be hierarchical. How¬ 
ever, the grid cells are extremely elongated near the poles, 
and the pixel areas vary strongly, so this grid is not suited to 

our pur pose. _ 

QuadCube ( [White & Stemwedel|1992| l: With fairly uniform 
pixels and a hierarchical structure, this pixelisation should 
be quite suitable. 
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IGLOO ( [Crittenden & Turok|1998| l: This grid is hierarchical by 
design and has reasonably uniform pixel sizes and compact 
pixel shapes. It should be a suitable basis for our shape rep¬ 
resentations, but owing to the specific refinement procedure 
near the poles, it might be hard to find traversals with good 
locality^ _ 

HEALPix ( [Gorski et al.|2005] l: Designed to be hierarchical 
with exactly equal pixel areas and fairly uniform pixel 
shap es, this pixelisation is very well suited to our studies. 

GLESP ( [Doroshkevich et al.|2005| l: Its geometrical require- 
m ents mean that this g rid is inherently non-hierarchical. 

HTM ( [Szalay et al.|2007| : This grid is inherently hierarchical. 
That its pixels are triangular does not present a problem, 
since recursive refinement in factors of 4 also works in 
this case. In combination with fairly uniform pixel sizes 
and shapes, it is a suitable underlying grid for our shape- 
representation algorithms. The only real concern is the 
choice of subpixel numbering: since the subpixels in the tri¬ 
angle corners get indices 0-2, and the one in the centre gets 
index 3, the traversing curve has bad locality, resulting in 
unnecessarily large RSs. This could be mitigated by using a 
modified counting scheme, however. 

For the tests presented in this paper, we have adopted the 

HEALPix grid for the following reasons: 

- The MOC standard already specifies HEALPix as underlying 
pixelisation, making results obtained for this grid especially 
relevant. 

- Our test data set (see next section) was already provided as 
HEALPix MOC objects; re-discretisation on a different grid 
would have been possible, but cumbersome, without provid¬ 
ing additional value. 

- The existing software for HEALPix is extensive and easily 
accessible, and already contains a considerable part of the 
required functionality (e.g. the conversion routines between 
NEST and Peano-Hilbeit indices). 

- Given our involvement in the development of the HEALPix 
code, working with this software was the most efficient 
choice. 


9. Validation and performance tests 


The practical application of the algorithms and data structures 
presented above will be demonstrated in the framework of the 
HEALPix C-n- library. This is convenient, since the so-called 
NESTED ordering scheme for HEALPix pixels is equivalent to 
Z-curve ordering, and since the C-h-h library also supports Peano- 
Hilbert indexing of pixels. This was added to allow the research 
presented by Schafer ( 2005 [ I. 

Eor our validations we made use of the very extensive col¬ 
lection of astronomical sky coverages available at http:// 
alasky.u-strasbg.fr/MocServer/MocQuery which at the 
time of download consisted of 14 633 data sets, stored as multi¬ 
order lists in PITS files. These contain very small, low-resolution 
shapes as well as large, but compact, survey coverages and 
highly fragmented star catalogues at high resolutions. MOL sizes 
range from roughly ten to several million entries, and maximum 
resolution orders vary between 0 and 19 (corresponding to a lin¬ 
ear pixel dimension of ~0.5 arcsec). 


9.1. Validation 

To test the correctness of our implementation, we verified a se¬ 
ries of identities, using the shapes in our data collection. In the 
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RSZ 

RS P 

MOL 

CRSZ 

CRS P 

RSZ 

— 

489 

100 

176 

656 

RS P 

472 

— 

391 

648 

167 

MOL 

80 

386 

— 

256 

553 

CRSZ 

166 

655 

266 

— 

822 

CRSP 

632 

160 

551 

828 

— 


Table 1. Overview of average conversion times from the shape repre¬ 
sentation in the top row to the one in the first column. The abbreviations 
denote “range set in Z ordering”, “range set in Peano-Hilbert ordering”, 
“multi-order list in Z ordering”, and the respective compressed variants 
of the range sets. All times are given in microseconds. 


following, A and B denote shapes represented as a RS in Z-order 
indexing. 

The following tests involving single data sets were per¬ 
formed on all available shapes; 

- FITS input/output: 

A = fromFITS(toFITS(A)) 

- conversion to/from MOL; 

A = fromMOL(toMOL(A)) 

- compression; 

9 

A = uncompress(compress(A)) 

- MOL compression; 

9 

A = froniMol(uncompress(compress(toMOL(A)))) 

- Peano-Hilbert conversion; 

9 

A = fromPeano(toPeano(A)) 

- complement; 

— 9 — 9 

AnA = 0;AuA = entire sphere 

Furthermore, we verified the following identities for a subset of 
all possible shape pairs; 

- (a n b) n B = 0 

9 9 

- AUB3A;AUB2B 

- AnBcA;AnBcB 

-AAB = (AuB)n (AnB) = (a n b) u (a n b). 

The subset was obtained by first sorting the available shapes ac¬ 
cording to their number of ranges and picking those at positions 
Pi ;= 147/, resulting in 100 shapes sampled fairly from all avail¬ 
able complexities. All 10 000 shape pairs constructible from this 
subset were tested. 

9.2. CPU benchmarks 

All tests were performed on a single core of an Intel Core i3- 
2120 CPU running at 3.3GHz, using gcc 4.7.3 as compiler. 

9.2.1. Conversions between different representations 

Sections l^andj^discussed various data structures for represent¬ 
ing a 2D shape with the conclusion that, depending on the con¬ 
crete usage scenario and the compactness of the shape, no single 
one fits all needs perfectly. Consequently, conversions between 
different representations may occur frequently, and it is impor¬ 
tant that they can be carried out quickly. 


XOR 

A A B 

98.30 

OR 

AUB 

72.36 

ANDNOT 

AnB 

34.93 

AND 

AnB 

5.66 

overlap 

AnB = 0 

1.66 

containment 

AUB=A 

0.19 

complement 

A 

43.80 


Table 2. CPU times for various unary and binary Boolean operations, 
averaged over all possible combinations of shapes in the available data 
set. Shapes are stored as range sets in Z ordering. All times are given in 
microseconds. 


Table [2 lists the conversion times between several represen¬ 
tations, averaged over all shapes in the test data set. Obviously, 
converting from a MOL to a RS in Z ordering and vice versa 
only takes around 100 microseconds on average, which is advan¬ 
tageous, because this kind of conversion is likely needed most 
often. 

Conversions between compressed and uncompressed RSs of 
the same scheme are more expensive, but only by about a fac¬ 
tor of two. This makes compressed range sets a very attractive 
choice for shape storage whenever memory is at a premium. 

Finally, changes between the Z-curve and Peano-Hilbert 
pixel numbering schemes are fairly costly, so it is probably best 
to decide on an overall numbering scheme for each application, 
and to perform this sort of conversion only during data exchange 
with other external programs, if unavoidable. 

9.2.2. Cost of Boolean operations 

Table|^lists average execution times for the Boolean operations 
on the shapes that are supported by the library. Probably the most 
noteworthy fact is that all operations are faster than (or in the 
worst case have run times similar to) the conversions discussed 
in Sect. |9.2.1| This implies that storing the shapes using a non- 
RS representation and converting to RS whenever needed incurs 
a substantial slowdown and should therefore only be done when 
computer memory is scarce. 

Amongst the binary operations, the exclusive-or operation 
consumes the most time. This is according to expectations, since 
XOR requires an examination of every range start- and endpoint 
for both involved RSs, while shortcuts are possible for AND, 
OR, and ANDNOT. Also, a RS constructed from two others via 
XOR tends to contain more ranges than ones constructed from 
the same sources using OR or AND. 

Computing the union (OR) of two shapes is more expen¬ 
sive on average than intersection (AND) or subtraction (AND¬ 
NOT), which might be because RSs of unions are typically 
longer than those of the last two operations. That construction 
of the result requires a large amount of the algorithm’s runtime. 
The pronounced difference between the average times for AND 
and ANDNOT is most likely caused by the nature of our input 
data set; most of the shapes only cover a very small part of the 
sphere, so intersections often result in empty or very small sets, 
whereas subtractions typically reproduce the first operand and 
consequently need more time to construct their output. 

The checks whether one shape overlaps another or is com¬ 
pletely contained within another require considerably less CPU 
time. This is unsurprising, since the test can exit early as soon 
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as the first overlap (or the first non-containment) is found, and 
since both algorithms return only a single Boolean value and do 
not have to construct an output RS. Again, owing to the nature of 
the test data set, which contains many small and very often dis¬ 
joint shapes, early exit is much more likely for the containment 
tests, thereby lowering the average CPU time for this case. 


9.3. Memory benchmarks 

As already mentioned in Sect.j^ the best choice for a small rep¬ 
resentation of a shape depends on its compactness. Regular, non- 
convoluted shapes are best represented as RSs, whereas multi¬ 
order lists have the size advantage in the case of strongly frag¬ 
mented shapes. The available test data set contains representa¬ 
tives of both types, with the fragmented shapes dominating quite 
strongly. 

To demonstrate the performance of the various represen¬ 
tations more clearly, we split the available data sets into two 
groups, using the phenomenological quantity / (for fragmenta¬ 
tion) 




^ianges,Z 

^pix,max 


(4) 


Here, nranges,z denotes the number of pixel ranges in a Z-order 
RS representation of the shape, and npix.max is the number of in¬ 
dividual pixels at the maximum resolution used for the shape’s 
description. Here, / ranges from 0 (very compact shape) to 1 (ex¬ 
tremely fragmented); we use a threshold of / = 0.1 to discern be¬ 
tween compact or “survey-like” and fragmented or “catalogue¬ 
like” shapes. 

Furthermore, we exclude all “small” shapes from our size 
comparisons, whose multi-order list representation has fewer 
than 100 entries. This is done because size ratios computed for 
such shapes tend to produce extreme values, although their to¬ 
tal resource consumption is close to negligible compared to the 
other shapes in the test data set, whose multi-order lists have up 
to several million entries. 

The results of the memory benchmarks are shown in Fig. 
The results of the size comparison of range sets to their corre¬ 
sponding multi-order lists are in good agreement with the esti¬ 
mates given in Sect. |5.2| For survey-like shapes, the range set 
representation tends to be consistently smaller than the multi¬ 
order list, while for catalogue-like shapes it is larger by a factor 
of almost 2. 

The improved locality of the Peano-Hilbeit ordering clearly 
has an effect on the RSs for compact shapes, reducing their size 
by roughly 25% on average. Again, the same is not true for frag¬ 
mented shapes, because physically separated individual pixels 
do not benefit from the change in the numbering scheme overall. 

Using binary interpolative coding to compress the Z-order 
range sets is effective for both kinds of shapes, but here as well 
the benefit for compact shapes is greater. In any case the re¬ 
sults demonstrate that RSs compressed by interpolative coding 
are substantially smaller than uncompressed multi-order lists in 
almost all cases. 


10. Summary and outlook 

We have presented and evaluated different approaches to repre¬ 
senting coverage information on 2D grids, while investigating 
alternative pixel numbering schemes and data structures. The 
most promising candidates for quick data processing and long¬ 
term storage were selected and implemented as part of the C-n- 
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HEALPix package. Correctness and performance were evalu¬ 
ated using a comprehensive set of VO data. 

The new functionality has also been integrated in to the 
HEALPix Java library to allow easier use by software for 
the Virtual Observatory community, which is largely imple¬ 
mented in that language. Both implementations are available 
under the terms of the GNU General Public License v2 from 
the HEALPix Subversion repository at http: //sourceforge. 
net/pro jects/healpix and they will be part of the next offi¬ 
cial release of the HEALPix package. 

Numerical experiments performed with the code indicate 


that the recommendations given in the MOC standard (Boch 
|et al.|2014| ) concerning the data format for shapes on the sphere 
and Boolean operations between them can be improved upon 
both in terms of required memory and CPU time, although not 
generally in both simultaneously. The compression scheme pre¬ 
sented in this paper is generally more space-efficient than the 
multi-order list described in the standard, and the RS representa¬ 
tion allows quicker Boolean operations, at the cost of requiring 
potentially more memory than the multi-order list. 

Eor applications that use pixel numbers for indexing objects 
in a data base, we recommend a Peano-Hilbert-based hierarchi¬ 
cal ordering scheme instead of Z ordering, because the superior 
locality properties of the former scheme lead to better clustering 
of database accesses for queries of compact shapes. 

The compact description of arbitrary shapes on discrete grids 
using the pixel-numbering schemes and formats presented in this 
paper are by no means restricted to two spatial dimensions. Ex¬ 
tension to higher dimensions is straightforward. The only ingre¬ 
dient requiring nontrivial changes are the hierarchical numbering 
schemes presented in Section but both Z curves and Peano- 
Hilbeit curves are available in higher dimensions, so that this 
part does not present a problem. All other parts of the paper are 
based on inherently ID pixel indices and are not affected by di¬ 
mensionality changes except for a few constants. 
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Fig. 5. Size comparisons for survey-like (left column) and catalogue-like (right column) spherical coverages. The first row shows a typical repre¬ 
sentative for the respective coverage. The second, third, and fourth rows show histograms of size ratios for range sets in Z-order pixel numbering 
compared to multi-order lists, Peano-Hilbert RS compared to Z-order RS, and compressed to uncompressed Z-order RS, respectively. 
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